This article was developed based on the final project for Social Media Course at Business Analytics Master Course at The University of Hong Kong. The project is a result of team effort and everybody worked really hard to deliver all analysis that are going to be presented later on. I would like to say thank you for all my team mates: August Hjorth, Maria Raquel Gomez Lopez, Martina de Luca, Xiaoyen Chen and Mawuli Adjei. :)
Later, me and Mawuli decided to use the project content to publish some articles that could be useful for other persons that are planning to develop similar studies. This will be the first of five different articles based in that project:
- Summary report: In this article we are focusing on insights obtained through the analysis and its application to business. It works as a summary, without go too deep into technical aspects. Since we are students at Business Analytics Course, we decided that would be interesting to show a little bit how analytics tools can be applied to give insights to business and support data drive strategies.
- Topic Modeling: In this article we are going to show in more detail how we develop the topic modeling. It will be more technical and focused on the step by step of our R-script.
- User Cluster: Similar to the Topic Modeling article, in this one we are going to focus on the User Cluster Development.
- Sentiment Analysis: Again, this will be more technical article, focused both on the sentiment analysis and agreement analysis.
- Social Networks Analysis with Gephi: In this article we are going to explore basic Social Network concepts as well give a brief introduction to Gephi.
The online newspaper sector is one of the most vulnerable to drastic changes especially due to the emergence of new online services and strategies. The challenges are growing , not only in terms of technology developments and distribution channels, but also due the presence of new competitors: social networks. Physical sales are decreasing and new online and social strategies are needed to attract an audience online and to increase ad revenue.
Now, in order to have a good performance in terms of profit and engagement it’s necessary to be active on many platforms. We decided to analyse the South China Morning Post due to their particular position which covers printed newspapers, an active Facebook presence, Instagram and Twitter account. Even though the newspapers sector is now facing many challenges, this journal is one of the most competitive in the world.
The South China Morning Post was established in Hong Kong in 1903 by Tse Tsan-Tai and Alfred Cunningham and published its first newspaper in English on November 6th 1900. 115 years later it is one of the most important news portal in Hong Kong and was acquired by Alibaba in 2015. It set out to be a global media company that reports mainly about Hong Kong and mainland China to an international audience.
SCMP produces content that covers multiple topics such as business, technology, and lifestyle. In order to achieve its mission of becoming the bridge of communication between Asia and the world, SCMP went through a series of transformations in 2018 that has allowed them to reach a broader audience.
In this project we analyzed the Twitter data from SCMP aiming to understand its Social Media Network. Our goal was to understand how the Twitter Users on SCMP’s account behave and then find opportunities on how SCMP can capitalize on its social media strategy.
This article is divide in the following sections:
- Data Extraction and Transformation
- Tools
- Overview
- The alternative approach
- Social Network graphs
- Conclusion
- Recomendation
- Resources
To extract the data from Twitter, we used the package rtweet using the search term “SCMPNews”. The data was extract three times during the period from Dec/18 and Jan/19:
- First Extraction: 18th - December - 2018
- Second Extraction: 25th - December - 2018
- Third Extraction: 1st - January - 2019
The resulting data was merged, and duplicated tweets were excluded. Besides that, from the original data extracted, the team broke it down into three different component files that was used in the further analysis:
- Vertex data: File with information about the user account like number of friends, number of tweets, account language and others.
- Edge data: File with user’s relationship trough tweets, like: retweet, reply, mention and others.
- Tweet Data: Is the original file extract using the Twitter API.
For our analysis, we had the option of choosing one of several software;
- R
- NodeXL
- Gephi
- Python
NodeXL had the capability of handling most of the social network analysis we wanted to do. However, due to its limitations as an Excel plug-in in handling large amounts of data, we focused on using R for data extraction, time-series, topic modeling, natural language processing and cluster analysis. Python was used for data wrangling, and sentiment analysis under natural language processing. Gephi was used to perform social network analysis based on the extracted data.
In our final data we have 25.372 uniques tweets, 13.267 users and 50.157 relarionships (We only considered for this analysis tweets in english) In average there are 1103 tweets per day in the data, but we clearly see that on the 18th of December, there were some event that made the amount of tweets increase dramatically. After some analysis we figure out that the trending topic at that day was the “USA and China Trade war”. You can see more details about this in Topic Cluster and User cluster article.
In this context, we know that SCMP is the primary source of content, so it is reasonable to expect that its social network would be built around its profile. However, what we could see from its graph is that the power of this account is so high that is even hard to identify relevant sub-networks in its structure. For example, when we compare the betweenness centrality from the top users, we see that the difference between SCMP (top 1) and the other ones is huge (note that this metric is normalized to make easier the comparison):
| User1 | Betweenness Centrality | Degree |
|---|---|---|
| Top 1 | 0.98 | 6412 |
| Top 2 | 0.12 | 654 |
| Top 3 | 0.05 | 298 |
| Top 4 | 0.03 | 343 |
| Top 5 | 0.02 | 121 |
According to Borgatti et al. (2013)2, betweenness centrality “is a measure of how often a given node falls along the shortest path between two other nodes”. So, the high value obtained for SCMP node means that its profile is “in the middle” of most conversations (i.e edges) and that’s why is difficult to find sub-communities in this network, since most of the relationships are directly related to SCMP. After running the modularity clustering available in Gephi, we were able to find 21 different communities and the most important ones are highlighted in dark blue, yellow, light blue, red, black, purple and all the other (small) ones are in gray. As one can notice, comparing to the giant component in dark blue where SCMP node is in, none of the others sub-networks have high relevance.
To come up with more insightful analysis we decided to user an alternative approach by combining topic clustering, user clustering and sentiment analysis to split this giant network into meaningful sub-networks. If you understand more deeply those analysis you can go through the following articles: Topic Modeling, User Cluster, Sentiment Analysis, Social Network Analysis with Gephi
As said before, SCMP is a important player in Media field, so it is always producing content about several different subjects. Those topics are classified in their website as follow:
With this said, we decided to identify the natural topics that were commented in SCMP network. Notice that we are considering not only SCMP posts but also all the other users’ tweets, which means that the analysis is broader then just the content that SCMP produces alone, since it considers the engagement to the topics spread through its network. To achieve this goal, we used a popular topic model algorithm called Latent Dirichlet Allocation (available in topicmodel R package), which is an unsupervised method. As output of this modeling part we got 6 different topics as described below:
- USA and China: News related to USA and China relationship. With special attention to the Trade war issue that was quite discussed in this period and that’s why these words are bigger than the others.
- International: News related to the international scene. In this part of the cloud we can see the name of some countries like “Canada”, “Japan” and “India”. In this topic the principal news was related to the arrest of a top Huawei executive in Canada.
- Hong Kong News: In this topic we have many different topic news but all of them happened or had some relationship with Hong Kong.
- Mainland China: News inside the Mainland China. Some of the popular news in this topic is the ones about a group of Christians that were arrested during this period.
- Sports: In this case the news that generated most engagement were the ones related to some rumors of Japan Olympic games boycott because of the new Japanese police about Whale Hunting.
- Business: General news about business. In this cloud we can see some terms like: “Jack Ma”, “CEOs”, “wrapping” and “Christmas”.
Interestingly, we didn’t find all the contents that SCMP classify in its website, which means that not all of them produces enough engagement. Probably the non-identified topics reach a very specific kind of user and, comparing to the whole network, it doesn’t play a role important enough for us to identify it as an isolated topic.
We also developed a user clustering to understand the different kind of users in our data and how they interact with the different topics. For this task we used a K-mean algorithm, with 9 groups. The variables used in this model are:
- Percentage of tweets in each of the topics found with the previous model
- Number of followers
- Number of following profiles
- Number of Tweets. Is this case is the total number of tweets that the users posted in his account.
- Average sentiment. According to the sentiment analysis that will be explained in the next session.
- Number of tweets in the period.
All variables were normalized before running the K-means algorithm and the best solution that we got had 9 groups:
Our result shows that we have some groups that are specialized in one of the topics obtained with the topic modeling part. Which means that during the period in our data their interaction was more focused in one of the topics when comparing to the other ones. For example, the cluster “Hong Kong News - Neutral sentiment” has more posts in the topic “Hong Kong News” than in the other ones. Regardless the topic variable, the users from the specialized clusters (“Hong Kong News - Neutral sentiment”, “China News with negative sentiment”, “Sports News with negative sentiment”, “International News with negative sentiment”, “Entertainment News”, “USA and China News with negative sentiment”) are similar in terms of account information. They have similar average number of followers, tweets and account age, and comparing to the other clusters, they are smaller. In summary, they are just average users with more interest in one of modeled topics than the others.
The other three groups (“Small users very active in this context”, “Big Accounts”, “Heavy tweet users”) are more generic in terms of topic interest. During the period they didn’t show any specialization in some of the topics, they rather talk/ interact with all of them. The difference between them is the size of the account: the group “Small users very active in this context” have the same characteristic from the specialized groups regarding the account size, but they have a different behavior in this context. They are very active and interested in all kind of news and they have the second higher number of tweets per user (average). The other two are more likely to be organization profiles (like SCMP) or influencers, like politicians and journalists.
Finally, having those user groups defined gives a better chance for targeting users with content that they are more likely to engage. This will lead to a greater likelihood for them to follow through visiting SCMP’s website for more news; which is way to SCMP capitalize its social media strategy.
Another interesting topic in social media is Sentiment Analysis and specially in the context of news, this analysis helps understand how users react to each topic. It can also help to understand how the opinion is spread along the relationships trough the network. But even more interestingly is the agreement analysis, which shows the extent to which the sentiment of the users matches the sentiments of the original news posts. This kind of analysis in this context can be used as a means of the assessing how peeople perceive the crediblity of a news agency.
As we can see from the cloud bellow, some words like: “american”, “propaganda”, “trade”, “war” and “China” are very frequent in posts classified as negative. While words like: “Jack Ma”, “founder”, “CEOs” and “bravo” are frequent in positive posts. Given the time-frame context and the fact that SCMP is a Chinese News paper targeting readers from accross the world, this word cloud shows that SCMP is objective in report its news. Under regular circumstances news on the trade war between China and the United States of America, during the time when tension was highest, would be expected to have negative sentiments. And this is reflected in the negative words in word cloud. So this shows objectivity when it comes to the reporting of SCMP’s news.
Our expectation from an agreement analysis is that on average, the sentiment scores of SCMP’s posts would match the sentiment scores of the the replies to SCMP’s posts. What we see in the agreement visualization below is that majority of SCMP’s tweets had neutral sentiments, while the negative sentimented tweets were more than the positive sentimented tweets. On the other hand the replies also had neutral sentiments with the remaining ones being evenly distributed among the positive and negative tweets. What this means is that there is a good level of agreement on average in the replies to the posts that SCMP put out. This indicates a good level of trust in SCMP’s users and this is positive for any news company.
Using some statistical models together with social network theory we were able to identify some patterns in data and understand more deeply the users involved in this network. Over the course of the analysis, our discussions and insights have led to the following conclusions.
- Negative news tends to have a bigger outreach twitter engagements.
- The SCMP Network is very concentrated around SCMP profile. More than 80% of all users are classified in the network that SCMP is in and most of users are direct neighbors.
- There are many disconnected sub-networks in the data that are connected to the giant component by some specific user. Those disconnected groups are formed by users with specific topic interest.
- Those users are also influencers in the network since their followers tend to replicate their sentiment.
We believe that SCMP can take opportunity of this knowledge to create prospect strategies and expand even more their network, by targeting the isolated groups using the right topic with the right perspective they can get the engagement form this users.
Twitter Icon made by Freepik from www.flaticon.com is licensed by CC 3.0 BY
Silge, J. & Robison, D. 2010. Text Mining with R - Chapter 6
Gephi wiki
Borgatti, S., Everett, M., Johnson, J. 2013. Analyzing Social Networks. 1st ed., SAGE
6 Social Network Graphs
After modeling the topic, user cluster and sentiment, we were able to put all those pieces together and split the data into sub - networks. To generate each of these visualizations we split the data according to the tweet topic classification (i.e edge) and then applied the color to the node’s correspondent to the user cluster and the edges colors are the sentiment classification.
Each of the sub-networks obtained are now relevant enough and meaningful, since we can easily see the kind of user and sentiment in each of them. Besides that, we can identify new brokers that wasn’t possible to identify in the first visualization. Which means that the influencers differ according to the context (topic) in the network.
So, based on this result some strategies can be tailored for each of those groups. For example, lets focused in USA and China News sub-network:
- We could identify other important users that have high betweenness centrality in the network (the nodes with higher diameter). They are in the middle of the shortest path between the isolated points and SCMP profile. It is interesting to notice that the sentiment from the “broker’s followers” replicates the broker’s sentiment spread previously, which means that their opinion is very important and in a certain way followed by the users connected with him.
Those isolated groups could be selected, for example, as target for prospect strategies. SCMP can try to reach them through advertisement to engage a direct conversation at first moment (convert them as followers). This will led them to engage more with SCMP content and eventually in the future they could be potential clients for its subscription.
We can identify the similar pattern in the following sub-networks: “Sports News”, “Mainland China News” and “Business News”.
In summary the information about each network can be used to design:
- Prospect strategies for the isolated groups (not directly connected with the giant component)
- Support to give the right context in each advertisement or marketing campaign run in the network. Since we already know the user’s characteristics and what they usually engage more.
- Improve user experience by customizing the SCMP website for registered users bases in the topics that they are more interested in.